Let's explore principal component analysis for the abalone data set.
This notebook utilized two sources: this notebook and this notebook.
%matplotlib inline
# numbers
import numpy as np
import pandas as pd
# stats
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler
# plots
import matplotlib.pyplot as plt
import seaborn as sns
# utils
import os, re
from pprint import pprint
Start with a method to extract data from a CSV file:
# Load data, filter outliers (zero height/length)
def abalone_load(data_file, infant=False, allofit=False):
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
# data
new_df = pd.read_csv(data_file, header=None, sep=' ', names=xlabs+ylabs)
# ---------------------
new_df['Volume'] = new_df['Length']*new_df['Diameter']*new_df['Height']
new_df['Volume (Normalized)'] = new_df['Volume']/new_df['Volume'].mean()
new_df['Volume (Log)'] = new_df['Volume'].apply( lambda x : np.log(x) )
new_df['Whole weight (Normalized)'] = new_df['Whole weight']/new_df['Whole weight'].mean()
new_df['Shell weight (Normalized)'] = new_df['Shell weight']/new_df['Shell weight'].mean()
new_df['Shucked weight (Normalized)'] = new_df['Shucked weight']/new_df['Shucked weight'].mean()
new_df['Viscera weight (Normalized)'] = new_df['Viscera weight']/new_df['Viscera weight'].mean()
new_df['Rings (Log)'] = new_df['Rings'].apply( lambda x : np.log(x) )
# ---------------------
if(infant):
return new_df[ new_df['Sex']=='I' ]
elif(allofit):
return new_df
else:
return new_df[ new_df['Sex']<>'I' ]
def adult_abalone_load(data_file):
return abalone_load(data_file,infant=False,allofit=False)
def infant_abalone_load(data_file):
return abalone_load(data_file,infant=True,allofit=False)
def abalone_removeoutliers(df,verbose=False):
len1 = len(df)
df = df[ df['Height'] < 0.30 ]
df = df[ df['Height'] > 0.001 ]
df = df[ df['Length'] > 0.001 ]
df = df[ df['Diameter'] > 0.001 ]
len2 = len(df)
if(verbose):
print "Removed",(len1-len2),"outliers"
return df
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs
# y data labels
ylabs = ['Rings']
#adt_df = abalone_removeoutliers(adult_abalone_load('abalone/Dataset.data'),verbose=True)
abalone_df = abalone_removeoutliers(abalone_load('abalone/Dataset.data',allofit=True))
The data consists of multiple columns, each a measurement that was made on the abalone, together with the variable that we are trying to predict, "Rings" (which is an indication of the age of the abalone).
abalone_df.describe()
abalone_df.columns
To begin, we can plot the covariance matrix of our dataframe, picking out some particular response variables and input variables:
slim = abalone_df[[u'Rings',u'Rings (Log)',
u'Volume',
u'Whole weight', u'Shell weight',
u'Shucked weight', u'Viscera weight']]
correlation = slim.corr()
plt.figure(figsize=(10,10))
sns.heatmap(correlation, square=True,annot=True,vmin=0,vmax=1,cmap='jet')
plt.title('Correlation Matrix (Unscaled Variables)')
plt.show()
We can already see the challenge inherent in this data set: the input variables (consisting of three physical measurements, length, height, and diameter, as well as several weight mesaurements) are all closely correlated, but none are not very well-correlated with the system response.
cols = slim.columns.tolist()[1:]
slim = slim.reindex(columns=cols)
X = slim.iloc[:,1:6].values
y = slim.iloc[:,0].values
#print X
#print "-"*20
#print y
# -----------------
bigcols = slim.columns.tolist()
bigX = slim.iloc[:,:].values
print bigX.shape
To perform principal components analysis it is important to strip the values of the variables and put them in a numpy array, or an array of arrays. The input variables are in X
, while the system responses are in Y
. The matrix bigX
contains both input and response variables.
def get_normed_mean_cov(X):
X_std = StandardScaler().fit_transform(X)
X_mean = np.mean(X_std, axis=0)
#X_cov = np.cov(X_std.T)
X_cov = (X_std - X_mean).T.dot((X_std - X_mean)) / (X_std.shape[0]-1)
return X_std, X_mean, X_cov
X_std, X_mean, X_cov = get_normed_mean_cov(X)
bigX_std, bigX_mean, bigX_cov = get_normed_mean_cov(bigX)
Next we define a method to obtain a normalized X, which is normalized to have a mean of 0 and a variance of 1, which is stored in X_std
. We also compute the mean and covariance of each input variable, and store them in X_mean
and X_cov
, respectively.
print "Covariance matrix"
print X_cov
We examine the covariance matrix, then plot it.
plt.figure(figsize=(8,8))
sns.heatmap(pd.DataFrame(bigX_cov,columns=bigcols),
vmax=1, vmin=0,
square=True,annot=True,cmap='jet')
plt.title('Correlation between different features (scaled)')
The best correlation for the response variable is between the shell weight and the log of the number of rings. None of these variables are particularly well-correlated to the system response, though, so PCA probably won't help us much here.
eigenvals, eigenvecs = np.linalg.eig(X_cov)
print "Eigenvals:\n",eigenvals
print "Eigenvecs:\n",eigenvecs
We compute the eigenvalues and eigenvectors, which tell us about the spectral "characteristic" vector of this covariance matrix. This can, in turn, be used to compute the principal components, as well as the variance in the data that is explained by each variable.
tot = sum(eigenvals)
explained_variance = [(i/tot) for i in sorted(eigenvals, reverse=True)]
from pprint import pprint
eigenvalvec = [(np.abs(eigenvals[i]), eigenvecs[:,i]) for i in range(len(eigenvals))]
pprint([pair[0] for pair in eigenvalvec])
The list above is the (sorted) list of eigenvalues corresponding to each eigenvector. By comparing the relavitve vlaues of these eigenvalues, we can determine what each principal component is and how important that principal component is:
plt.figure(figsize=(6, 4))
plt.bar(range(5), explained_variance, alpha=0.5, align='center',
label='individual explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
We can also plot the amount of variance explained by each input variable. In this case, we have one principal component that explains nearly all (95%) of the variance. The reason for this is not that we have discovered a very convenient principal component - the reason is, the principal component is a mixture of all variables in nearly equal proportions. That is, the principal component is the one that combines every single variable.
In other words, we can't eliminate any variables.
from sklearn.decomposition import PCA
pca = PCA().fit(X_std)
S_pca = pca.transform(X_std)
fig = plt.figure(figsize=(10,4))
ax1, ax2 = fig.add_subplot(121), fig.add_subplot(122)
# These two plots should be exactly the same
ax1.plot(np.cumsum(explained_variance))
ax1.plot(np.cumsum(pca.explained_variance_ratio_))
ax1.set_xlim(0,5,1)
ax1.set_xlabel('Number of Principal Components')
ax1.set_ylabel('Cumulative explained variance')
ax1.set_title('Explained Variance')
ax2.plot(np.cumsum(explained_variance))
ax2.plot(np.cumsum(pca.explained_variance_ratio_))
ax2.set_xlim(0,5,1)
ax2.set_ylim([0,1.1])
ax2.set_xlabel('Number of Principal Components')
ax2.set_ylabel('Cumulative explained variance')
ax2.set_title('Explained Variance')
plt.show()
If we limit the number of principal components to 3, we get the same result: there is only one principal component, which consists of every variable weighted equally.
from sklearn.decomposition import PCA
sklearn_pca = PCA(n_components=3).fit(X_std)
#Y_sklearn = sklearn_pca.fit_transform(X_std)
print sklearn_pca.components_
print np.linalg.norm(sklearn_pca.components_[0],2)
Examine the top row - the first principal component. This vector shows the weights of each variable in the principal components. So, whoop de doo, we found a single principal component that can account for 95% of the variance - but that's because it does f--k all to reduce the number of variables. It assigns virtually identical weights of 0.44 or 0.45 to every single component.
Examining infant and adult datasets separately shows that these two data frames are different: infant abalones have a stronger correlation between age and shell weight (and other input variables) than do adult abalones.
adt_df = abalone_removeoutliers(adult_abalone_load('abalone/Dataset.data'))
inf_df = abalone_removeoutliers(infant_abalone_load('abalone/Dataset.data'))
adt_df.columns
def get_normed_mean_cov(X):
X_std = StandardScaler().fit_transform(X)
X_mean = np.mean(X_std, axis=0)
#X_cov = np.cov(X_std.T)
X_cov = (X_std - X_mean).T.dot((X_std - X_mean)) / (X_std.shape[0]-1)
return X_std, X_mean, X_cov
def cov_pca_plots(df,labels,title=''):
slim = df[labels]
slimcols = labels[1:]
X = slim.iloc[:, 1:].values
Y = slim.iloc[:, 0].values
bigcols = labels
bigX = slim.iloc[:,:].values
X_std, X_mean, X_cov = get_normed_mean_cov(X)
bigX_std, bigX_mean, bigX_cov = get_normed_mean_cov(bigX)
eigenvals, eigenvecs = np.linalg.eig(X_cov)
tot = sum(eigenvals)
explained_variance = [(i/tot) for i in sorted(eigenvals, reverse=True)]
eigenvalvec = [(np.abs(eigenvals[i]), eigenvecs[:,i]) for i in range(len(eigenvals))]
pprint([pair[0] for pair in zip(eigenvals,eigenvecs)])
sns.heatmap(pd.DataFrame(bigX_cov,columns=bigcols),
vmax=1, vmin=0,
square=True,annot=True,cmap='jet')
plt.title(title+' Covariance Plot ')
print "-"*20
pca = PCA().fit(X_std)
S_pca = pca.transform(X_std)
print "Top 4 Principal Components:"
print pca.components_[:4]
fig = plt.figure(figsize=(10,4))
ax1, ax2 = fig.add_subplot(121), fig.add_subplot(122)
# These two plots should be exactly the same
ax1.plot(np.cumsum(explained_variance))
ax1.plot(np.cumsum(pca.explained_variance_ratio_))
ax1.set_xlim(0,5,1)
ax1.set_xlabel('Number of Principal Components')
ax1.set_ylabel('Cumulative explained variance')
ax1.set_title('Explained Variance')
ax2.plot(np.cumsum(explained_variance))
ax2.plot(np.cumsum(pca.explained_variance_ratio_))
ax2.set_xlim(0,5,1)
ax2.set_ylim([0,1.1])
ax2.set_xlabel('Number of Principal Components')
ax2.set_ylabel('Cumulative explained variance')
ax2.set_title('Explained Variance')
plt.title('Correlation between different features (scaled)')
Here is a plot of the covariance matrix for infant abalones. Note that the correlation between age (log of number of rings) and all other input variables is stronger for the infant abalones than for adult abalones:
labels = ['Rings (Log)',
'Volume',
'Whole weight',
'Shell weight',
'Shucked weight',
'Viscera weight']
cov_pca_plots(inf_df,labels,'Infant Abalone')
labels = ['Rings (Log)',
'Volume',
'Whole weight',
'Shell weight',
'Shucked weight',
'Viscera weight']
cov_pca_plots(adt_df,labels,'Adult Abalone')
We can also explore the effect of adding additional variables. However, it's important to keep in mind that if we only have length and weight information, any derived variables will themselves be length or weight variables, or encoded versions thereof.
def add_df_columns(df):
df['Shell-Whole Weight Ratio'] = df['Shell weight']/df['Whole weight']
df['Shell-Shucked Weight Ratio'] = df['Shell weight']/df['Shucked weight']
df['Shell-Viscera Weight Ratio'] = df['Shell weight']/df['Viscera weight']
df['Specific Volume'] = df['Volume']/df['Whole weight']
df['Weight-Length'] = df['Whole weight']/df['Volume'].apply(lambda x : pow(x,1./3.))
df['Growth Rate'] = df['Whole weight']/df['Rings']
return df
adt_df = add_df_columns(adt_df)
inf_df = add_df_columns(inf_df)
labels = ['Rings',
'Volume',
'Whole weight',
'Shell weight',
'Shucked weight',
'Viscera weight',
'Weight-Length',
'Growth Rate']
cov_pca_plots(inf_df,labels,'Infant Abalone')
labels = ['Rings (Log)',
'Volume',
'Whole weight', 'Shell weight',
'Shucked weight', 'Viscera weight',
'Weight-Length']
cov_pca_plots(adt_df,labels,'Adult Abalone')
From the above, we see the same pattern - there is one principal component that accounts for over 90% of the variance, and that principal component is an evenly-weighted mixture of each input variable.